Learn how Algolia handles tokenization.
-
).
To form a token, we consider a string character-by-character. Our tokenizer identifies the longest groups of contiguous characters belonging to the same class (separator or non-separator), and creates a token for each group.
For example, the string Hello, World!
is tokenized as these four tokens:
Hello
(non-separator),
(with a trailing space) (separator)World
(non-separator)!
(separator)Hello
and World
are comprised of non-separator characters, while ,
(with a trailing space) and !
are comprised of separators.
Only non-separator characters are indexed, and thus searchable, by default. In the example above, only Hello
and World
are indexed. Regardless if a user searches for Hello, World!
or hello world
, any record with these tokens will be a match.
separatorsToIndex
. When you include a character in this parameter, it has three consequences:
separatorsToIndex
is set to #@
(hash and at sign), then the string #@lgolia!!
is tokenized as:
#
(non-separator)@
(non-separator)lgolia
(non-separator)!!
(separator)#
and @
are included in separatorsToIndex
, we index the tokens #
, @
, and lgolia
. Note that even though they appear next to each other, #
and @
are separate tokens.
Now, when a user searches for #
, @
, or LGOLIA!!
this record will be a match.
separatorsToIndex
on its own, when it’s directly adjacent to a non-separator token, we want to keep the order as a requirement for the search query.
For example, assuming that @
is included in separatorsToIndex
, then the string alice@wonderland
is interpreted as alice @ wonderland
(all tokens must be adjacent, in this order). The phrase alice @ wonderland
(with spaces in between) has the same tokens, but with no restrictions on order. A search for alice@wonderland
, returns records with alice@wonderland
and alice @ wonderland
(with spaces), but not records with wonderland @ alice
or alice was @ wonderland
.
When tokens must be found in a particular order, it is known as a sequence expression.
We also always create sequence expressions when alphanumeric characters surround a hyphen (-
), whether the hyphen is in separatorsToIndex
or not. For example, the term real-time
creates a sequence expression, meaning that the query real-time
matches records with real time
and real-time
, but not real [...] time
, time real
, or time [...] real
([...]
indicates other words in the string). The query real time
, without a hyphen, matches any records with those two words, no matter the order or proximity.
real-time
to real time